256 ◾ Bioinformatics
tools proposed an extra step before taxonomic assignment. This step is meant to reduce the
potential errors that may produce noises; therefore, the step is called denoising.
7.2.2.2 Denoising
There are two possible types of errors that may occur on deciding whether the variation
within an OTU represents errors or real diversity. The first type is the base calling error
which may arise from the sequencing. This type of errors may occur due to the incorrect
base pairing during the PCR amplification, polymerase slippage, or PCR chimeras that
are formed when the DNA strand extension is aborted during the PCR process and the
aborted products act as primers in the next PCR cycle producing artifacts. The second
type of errors is the misclassification of a read to an incorrect taxonomic group. This error
can be corrected by constructing OTUs at a particular similarity threshold such as 97%.
However, that may come at the cost of taxonomic sensitivity. Denoising is attempting to
handle these errors by using the reads to infer the correct biological sequences. This way
the misclassification can be avoided.
Several computational approaches have been proposed for sequence denoising. The most
commonly used approaches are DADA2, Deblur, and UNOISE3 which are able to infer
error-free biological sequences at a single-nucleotide resolution. Those inferred sequences
that will be used for taxonomic assignment are called features, zero-radius OTUs (ZOTUs),
exact sequence variants (ESVs), or amplicon sequence variants (ASVs). In the following, we
will discuss those three popular denoising methods.
7.2.2.2.1 DADA2 Denoising
DADA2 (Divisive Amplicon Denoising Algorithm 2) [8] was adapted to use with Illumina
sequencing and available as an open-source R package and as plugin in QIIME2, which
is an open-source command-line Linux program. DADA2 implements a new model of
Illumina-sequenced amplicon errors that incorporates quality information of the within-
sequence errors and between-sequence errors. The model quantifies the error rate (λ) at
which an amplicon read is produced from a sample sequence as a function of a sequence
composition and quantity. The number of amplicons or abundance follows the Poisson
distribution with the parameter
ji
λ , which is the error rate at which an amplicon read
with sequence i is produced from sample sequence j. The abundance of the sequence i has
an expected value equal to an error rate
ji
λ multiplied by the expected reads of sample
sequence j.
The DADA2 model assumes that errors occur independently with a read and indepen-
dently between reads. The model then estimates the error rate as the product over the
transition probabilities between the L aligned nucleotides and associated quality score of
the original nucleotide as follows:
p j l
i l q l
ji
l
L
i
∏
λ
(
)
( )
( )
( )
=
→
=
,
0
(7.1)